Training a Logistic Regression model with Gradient Descent optimization (binary classification)¶

  • input features: x1x1, x2x2, ..., xnxn and an extra constant 11 (bias)

  • output feature: yy - a probability prediction for being in Class 1 (being positive in case of a disease)

  • weights are calculated in the training phase to minimize a loss function, weights w1,w2,...wnw1,w2,...wn and a weight associated with the bias w0w0 (intercept)

  • first we calculate the linear combination

wTx+w0=w1x1+w2x2+...+wnxn+w0wTx+w0=w1x1+w2x2+...+wnxn+w0

  • Then we apply an activation function, typically a sigmoid function

ypred=f(w,x)=σ(wTx+w0)σ(z)=11+e−zypred=f(w,x)=σ(wTx+w0)σ(z)=11+e−z

Cross entropy - loss function for binary classification¶

  • ytrueytrue: binary (0-1) vector of the true categories
  • ypredypred: vector of the predictions with probabilities, 0≤ypred≤10≤ypred≤1

LOSS=−1nsamples∑ytruelog(ypred)+(1−ytrue)log(1−ypred)LOSS=−1nsamples∑ytruelog⁡(ypred)+(1−ytrue)log⁡(1−ypred)

Reminder: σ′(z)=σ(z)⋅(1−σ(z))σ′(z)=σ(z)⋅(1−σ(z))

L(w)=−ytrue⋅log(ypred)−(1−ytrue)⋅log(1−ypred)L(w)=−ytrue⋅log⁡(ypred)−(1−ytrue)⋅log⁡(1−ypred)

∂L(w)∂wi=−ytrue⋅1ypred⋅σ(z)ypred⋅(1−σ(z))1−ypred⋅xi−(1−ytrue)⋅11−ypred⋅(−1)⋅σ(z)ypred⋅(1−σ(z))1−ypred⋅xi∂L(w)∂wi=−ytrue⋅1ypred⋅σ(z)⏟ypred⋅(1−σ(z))⏟1−ypred⋅xi−(1−ytrue)⋅11−ypred⋅(−1)⋅σ(z)⏟ypred⋅(1−σ(z))⏟1−ypred⋅xi

∂L(w)∂wi=−ytrue(1−ypred)⋅xi+(1−ytrue)ypred⋅xi=(ypred−ytrue)⋅xi––––––––––––––––––––––––––––––––––––––∂L(w)∂wi=−ytrue(1−ypred)⋅xi+(1−ytrue)ypred⋅xi=(ypred−ytrue)⋅xi__

X=⎛⎜ ⎜ ⎜ ⎜ ⎜⎝x10x11⋯x1nx20x21⋯x2n⋮⋮⋱⋮xk0xk1⋯xkn⎞⎟ ⎟ ⎟ ⎟ ⎟⎠∈Rk×(n+1)      y=⎛⎜ ⎜ ⎜ ⎜⎝y1y2⋮yk⎞⎟ ⎟ ⎟ ⎟⎠Rk×1X=(x10x11⋯x1nx20x21⋯x2n⋮⋮⋱⋮xk0xk1⋯xkn)∈Rk×(n+1)      y=(y1y2⋮yk)Rk×1

  • without per-sample normalization: ∇LOSS=XT(ypred−ytrue)∇LOSS=XT(ypred−ytrue)
In [27]:
from sklearn.metrics import log_loss

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

%matplotlib inline

import pandas as pd

names = ["Sample_code_number", "Clump_Thickness", "Uniformity_of_Cell_Size", "Uniformity_of_Cell_Shape",
         "Marginal_Adhesion", "Single_Epithelial_Cell_Size", "Bare_Nuclei", "Bland_Chromatin",
         "Normal_Nucleoli", "Mitoses", "Class"]

#df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",
#                  names=names,
#                  na_values="?")

df = pd.read_csv("breast-cancer-wisconsin.data",
                  names=names,
                  na_values="?")

df["Bare_Nuclei"] = df["Bare_Nuclei"].fillna(df["Bare_Nuclei"].median())
df["Bare_Nuclei"] = df["Bare_Nuclei"].astype('int64')
df = df.set_index("Sample_code_number")

X = df[names[1:-1]].values   
y = df[names[-1]].values // 2 - 1
X.shape, y.shape
Out[27]:
((699, 9), (699,))
In [28]:
### Append (preprend) the constant 1 (bias) column to X

num_samples = X.shape[0]
num_features = X.shape[1]

X = np.hstack((np.ones((num_samples, 1)), X))
X.shape
Out[28]:
(699, 10)
In [36]:
### Create the necessary functions to 
### predict from (X, w) and calculate the gradient from (X, w,y ) 

def sigmoid(z):
    return 1 / (1 + np.exp(-z))
    
def feedforward(X, w):
    
    z = X @ w  
    y_pred = sigmoid(z)
    
    return y_pred

def backprop(X, w, y):

    y_pred = feedforward(X, w)
    delta = y_pred - y

    gradient = X.T @ delta
    
    return gradient
In [50]:
w0 = np.zeros(num_features + 1)

gradient_true = backprop(X, w0, y)
gradient_true
Out[50]:
array([ 108.5, -190. , -488.5, -460. , -356. , -153. , -606.5, -239.5,
       -411. ,  -68.5])

Do a sanity check on the gradient¶

Reminder: in case of a two-variable function: ∂f∂x(x,y)=limh→0f(x+h,y)−f(x,y)h∂f∂x(x,y)=limh→0f(x+h,y)−f(x,y)h ∂f∂y(x,y)=limh→0f(x,y+h)−f(x,y)h∂f∂y(x,y)=limh→0f(x,y+h)−f(x,y)h

In [60]:
h = 0.0000001

w0 = np.zeros(num_features + 1)

y_pred = feedforward(X, w0)
f0 = log_loss(y, y_pred, normalize=False)
print(f0)

idx = 9
w0[idx] += h 

y_pred = feedforward(X, w0)
f1 = log_loss(y, y_pred, normalize=False)
print(f1)

print(f"Gradient approximation:\t{(f1 - f0) / h}")
print(f"True gradient:\t\t{gradient_true[idx]}")
484.50987921140177
484.50987236140645
Gradient approximation:	-68.49995315860724
True gradient:		-68.5

Optimization: Gradient Descent algorithm¶

f(x,y)→min!f(x,y)→min!

xn+1=xn−α⋅∂f∂x(xn,yn)yn+1=yn−α⋅∂f∂y(xn,yn)⎫⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪⎭xn+1=xn−α⋅∂f∂x(xn,yn)yn+1=yn−α⋅∂f∂y(xn,yn)}

Pn+1=Pn−α⋅∇f(P)Pn+1=Pn−α⋅∇f(P)

In [5]:
from IPython.display import Image
Image(filename='fv_3d.png') 
Out[5]:
No description has been provided for this image
In [6]:
Image("contours.png")
Out[6]:
No description has been provided for this image
In [7]:
Image("grad_desc.png", width=600)
Out[7]:
No description has been provided for this image
In [68]:
from tqdm import tqdm

def logreg_fit(X, y, learning_rate=0.001, num_epochs=10):

    loss_hist = []
    w = np.zeros(X.shape[1])
    
    for idx in tqdm(range(num_epochs)):

        gradient = backprop(X, w, y)

        w = w - gradient * learning_rate

        y_pred = feedforward(X, w)
        loss = log_loss(y, y_pred)
        loss_hist.append(loss)
    
    return w, loss_hist
In [74]:
w_opt, loss_hist = logreg_fit(X, y, 1e-4, 10000)

plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history      final loss: {loss_hist[-1]}");
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 3773.60it/s]
No description has been provided for this image

Try out different learning rate parameters: smaller, medium, larger ones¶

In [75]:
w_opt, loss_hist = logreg_fit(X, y, 1e-5, 10000)

plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history      final loss: {loss_hist[-1]}");
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 3762.42it/s]
No description has been provided for this image
In [76]:
w_opt, loss_hist = logreg_fit(X, y, 1e-4, 10000)

plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history      final loss: {loss_hist[-1]}");
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 3706.56it/s]
No description has been provided for this image
In [77]:
w_opt, loss_hist = logreg_fit(X, y, 1e-3, 10000)

plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history      final loss: {loss_hist[-1]}");
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 3760.45it/s]
No description has been provided for this image
In [78]:
w_opt, loss_hist = logreg_fit(X, y, 1e-2, 10000)

plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history      final loss: {loss_hist[-1]}");
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 3749.72it/s]
No description has been provided for this image
In [121]:
w_opt, loss_hist = logreg_fit(X, y, 1e-1, 50)

plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history      final loss: {loss_hist[-1]}");
  0%|                                                                                                                                                                                          | 0/50 [00:00<?, ?it/s]/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
/var/folders/sq/_vdvf2nn51nbbtm87hrx368h0000gn/T/ipykernel_98039/3481577907.py:5: RuntimeWarning: overflow encountered in exp
  return 1 / (1 + np.exp(-z))
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 1575.83it/s]
No description has been provided for this image
In [80]:
w_opt, loss_hist = logreg_fit(X, y, 3e-4, 10000)

plt.plot(loss_hist)
plt.xlabel("Number of epochs")
plt.ylabel("Log loss")
plt.title(f"Loss history      final loss: {loss_hist[-1]}");
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:02<00:00, 3692.98it/s]
No description has been provided for this image

Stochastic Gradient Descent (SGD) algorithm¶

LOSS(w)=∑numsamplesLi(w)LOSS(w)=∑numsamplesLi(w)

  • Shuffle the samples randomly
  • iterate through the sum loop by batches and update ww after every single batch
In [97]:
rs = np.random.RandomState(42)

rs.choice(10, size=10, replace=False)
Out[97]:
array([8, 1, 5, 0, 7, 2, 9, 4, 3, 6])
In [101]:
from tqdm import tqdm


def logreg_sgd_fit(X, y, learning_rate=0.001, num_epochs=10, batch_size=32, random_state=42):
   
    rs = np.random.RandomState(random_state)

    loss_hist = []
    w = np.zeros(X.shape[1])
    num_samples = X.shape[0]
    
    for _ in tqdm(range(num_epochs)):
        
        permutation = rs.choice(num_samples, size=num_samples, replace=False)
        X = X[permutation]
        y = y[permutation]
        
        for idx in range(num_samples // batch_size):

            X_batch = X[idx * batch_size: (idx + 1) * batch_size]
            y_batch = y[idx * batch_size: (idx + 1) * batch_size]

            gradient = backprop(X_batch, w, y_batch)
    
            w = w - gradient * learning_rate
    
            y_pred = feedforward(X, w)
            loss = log_loss(y, y_pred)
            loss_hist.append(loss)
    
    return w, loss_hist
In [103]:
w_opt, loss_hist = logreg_sgd_fit(X, y, 3e-3, 100, batch_size=32)

plt.plot(loss_hist)
plt.xlabel("Number of iterations")
plt.ylabel("Log loss")
plt.title(f"Loss history      final loss: {loss_hist[-1]}");
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 171.38it/s]
No description has been provided for this image
In [104]:
w_opt, loss_hist = logreg_sgd_fit(X, y, 1e-3, 100, batch_size=32)

plt.plot(loss_hist)
plt.xlabel("Number of iterations")
plt.ylabel("Log loss")
plt.title(f"Loss history      final loss: {loss_hist[-1]}");
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 170.10it/s]
No description has been provided for this image
In [120]:
w_opt, loss_hist = logreg_sgd_fit(X, y, 1e-1, 100, batch_size=32)

plt.plot(loss_hist)
plt.xlabel("Number of iterations")
plt.ylabel("Log loss")
plt.title(f"Loss history      final loss: {loss_hist[-1]}");
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 172.22it/s]
No description has been provided for this image
In [10]:
### Apply some smoothing on the loss curves
In [113]:
w_opt, loss_hist = logreg_sgd_fit(X, y, 1e-3, 100, batch_size=32)

plt.plot(loss_hist)
plt.xlabel("Number of iterations")
plt.ylabel("Log loss")

loss_smoothed = np.convolve(loss_hist, np.ones(100) / 100, mode="valid")
plt.plot(range(100, len(loss_smoothed) + 100), loss_smoothed, "r-")

plt.title(f"Loss history      final loss: {loss_hist[-1]}");
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [00:00<00:00, 164.93it/s]
No description has been provided for this image
In [118]:
### Compare the results to the sklearn.linear_model implementation

w_opt, loss_hist = logreg_fit(X, y, 3e-4, 100000)
w_opt
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 100000/100000 [00:26<00:00, 3776.54it/s]
Out[118]:
array([-9.71454392,  0.53464755,  0.01128227,  0.32376783,  0.23762062,
        0.05832409,  0.42816088,  0.41212863,  0.15824303,  0.53584273])
In [117]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(C=10000)

model.fit(X[:, 1:], y)

model.intercept_, model.coef_
Out[117]:
(array([-9.72392166]),
 array([[0.53527556, 0.010504  , 0.32491021, 0.23780949, 0.0579741 ,
         0.4285776 , 0.41241511, 0.15827911, 0.53749624]]))

Neural network training - backpropagation¶

In [11]:
Image(filename='nn.png')
Out[11]:
No description has been provided for this image

Backpropagation algorithm¶

How can we optimize the weights in a general neural network? We need gradients, i.e. the partial derivatives with respect to all of the weights.

Let's try to understand the calculation with a simple example. We have a 2D binary classification problem with 1 hidden layer of hh neurons. We need two weight matrices: W(1)W(1) of size 3×h3×h and W(2)W(2) of size (h+1)×1(h+1)×1 Forward propagation just consists of matrix-vector multiplications and applying the activation functions

x0=1x1x2z(1)1=W(1)01+W(1)11x1+W(1)21x2z(1)2=W(1)02+W(1)12x1+W(1)22x2⋮z(1)h=W(1)0h+W(1)1hx1+W(1)2hx2a(1)0=1a(1)1=σ(z(1)1)a(1)2=σ(z(1)2)⋮a(1)h=σ(z(1)h)z(2)1=W(2)01+W(2)11a(1)1+W(2)21a(1)2+⋯+W(2)h1a(1)hypred=a(2)1=σ(z(2)1)x0=1x1x2z1(1)=W01(1)+W11(1)x1+W21(1)x2z2(1)=W02(1)+W12(1)x1+W22(1)x2⋮zh(1)=W0h(1)+W1h(1)x1+W2h(1)x2a0(1)=1a1(1)=σ(z1(1))a2(1)=σ(z2(1))⋮ah(1)=σ(zh(1))z1(2)=W01(2)+W11(2)a1(1)+W21(2)a2(1)+⋯+Wh1(2)ah(1)ypred=a1(2)=σ(z1(2))

The contribution of a single sample to the log loss function:

L=−ytruelog(ypred)−(1−ytrue)log(1−ypred)L=−ytruelog⁡(ypred)−(1−ytrue)log⁡(1−ypred)

We have to calculate the partial derivatives of LL with respect to the entries of W(1)W(1) and W(2)W(2). It is easier to start from backwards with the W(2)W(2) weights and apply the chain rule:

∂L∂W(2)i1=−ytrue⋅1ypred⋅∂σ(z(2)1)∂W(2)i1+(1−ytrue)⋅11−ypred⋅∂σ(z(2)1)∂W(2)i1∂L∂Wi1(2)=−ytrue⋅1ypred⋅∂σ(z1(2))∂Wi1(2)+(1−ytrue)⋅11−ypred⋅∂σ(z1(2))∂Wi1(2)

∂L∂W(2)i1=−ytrue⋅1σ(z(2)1)⋅σ(z(2)1)(1−σ(z(2)1))⋅a(1)i+(1−ytrue)⋅11−σ(z(2)1)⋅σ(z(2)1)(1−σ(z(2)1))⋅a(1)i∂L∂Wi1(2)=−ytrue⋅1σ(z1(2))⋅σ(z1(2))(1−σ(z1(2)))⋅ai(1)+(1−ytrue)⋅11−σ(z1(2))⋅σ(z1(2))(1−σ(z1(2)))⋅ai(1)

∂L∂W(2)i1=(ypred−ytrue)a(1)i=δ(2)1a(1)i∂L∂Wi1(2)=(ypred−ytrue)ai(1)=δ1(2)ai(1)

Now we can get the derivatives w.r.t. the entries of W(1)W(1) by using a massive amount of chain rules:

∂L∂W(1)ij=−ytrue⋅1ypred⋅∂σ(z(2)1)∂W(1)ij+(1−ytrue)⋅11−ypred⋅∂σ(z(2)1)∂W(1)ij=∂L∂Wij(1)=−ytrue⋅1ypred⋅∂σ(z1(2))∂Wij(1)+(1−ytrue)⋅11−ypred⋅∂σ(z1(2))∂Wij(1)= −ytrue⋅1ypred⋅∂σ(z(2)1)∂z(1)j⋅∂z(1)j∂W(1)ij+(1−ytrue)⋅11−ypred⋅∂σ(z(2)1)∂z(1)j⋅∂z(1)j∂W(1)ij=−ytrue⋅1ypred⋅∂σ(z1(2))∂zj(1)⋅∂zj(1)∂Wij(1)+(1−ytrue)⋅11−ypred⋅∂σ(z1(2))∂zj(1)⋅∂zj(1)∂Wij(1)= =(ypred−ytrue)⋅W(2)j1a(1)j(1−a(1)j)⋅xi=(ypred−ytrue)⋅Wj1(2)aj(1)(1−aj(1))⋅xi ∂L∂W(1)ij=δ(1)j⋅xi∂L∂Wij(1)=δj(1)⋅xi

Short summary:

  • using a given set of weights we can use feedforward calculation to get the activations and outputs
  • we calculate the δ(2)=ypred−ytrueδ(2)=ypred−ytrue error on the output layer
  • we propagate the errors backwards to calculate other δδ values δ(1)=W(2)δ(2)⋅a(1)⋅(1−a(1))δ(1)=W(2)δ(2)⋅a(1)⋅(1−a(1))
  • from the deltas we can simply get the partial derivatives: ∂L∂W(1)ij=δ(1)j⋅xi=(delta on the next layer)×(activation on the previous layer)∂L∂Wij(1)=δj(1)⋅xi=(delta on the next layer)×(activation on the previous layer)
  • then we have the gradient and can apply any gradient-based optimization method